Vamos a llevar a cabo un enriquecimiento funcional con diferentes paquetes y
bases de datos de los genes anotados con los paquetes de anotación (annotatr).
Un posible diseño será el siguiente:
Además, voy a probar otros programas además de la aproximación clásica con
clusterProfiler.
Link paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4840234/
We analyzed eight ChIP-seq data sets from a range of human and mouse cells and tissues (Supplementary Table 5), each with a different distribution of proximal and distal binding events (Fig. 2a). We tested each data set in six different ways: (i) by reproducing the original study’s list of enrichments, or if the original study did not report enrichments, by using DAVID on the set of genes with binding events within 2 kb of the transcription start site; (ii) by using GREAT with the default regulatory domain definition (basal promoter 5+1 kb and extension up to 1 Mb); (iii) by using GREAT’s hypergeometric test on the set of genes with binding events within 2 kb of the transcription start site, to control for the different gene mappings and ontologies in DAVID and GREAT; (iv) by using GREAT with a 5+1 kb basal promoter and a more limited 50 kb extension; and (v, vi) by using GREAT with either one (v) or two (vi) nearest genes up to 1 Mb (Tables 2 and and3,3, and Supplementary Tables 6–44, indexed in Supplementary Table 45).
clusterProfilerLink vignette: https://yulab-smu.github.io/clusterProfiler-book/index.html
Lo mejor va a ser mirar la vignette, ya que hay bastantes posibilidades. Respecto a cosas interesantes, podemos:
ChIPseeker, puede ser interesante.dataAnnotatr <- read.csv(file.path(dataPath, dirO.7, "annotatr_results_E1_0.7.csv"))
allGenes <- na.omit(unique(dataAnnotatr$annot.symbol))
allGenesDF <- bitr(allGenes, fromType = "SYMBOL",
toType = c("ENSEMBL", "ENTREZID"),
OrgDb = org.Hs.eg.db)
clusterProfilerAquí podríamos coger como universo otra cosa: para ser más específicos, quizás podría coger genes presentes en alguna línea celular stem y ver qué genes están enriquecidos en la nuestra.
ggo1 <- sapply(c("MF", "CC", "BP"), function(term) {
enrichGO(unique(allGenesDF$ENTREZID), org.Hs.eg.db,
ont = term, qvalueCutoff = 0.05,
readable = TRUE)})
Respecto al warning que devuelve, parece que es algo generalizado y que ya les han comentado en git: https://github.com/YuLab-SMU/enrichplot/issues/22.
dotplot(ggo1$MF, showCategory = 20, font.size = 8)
barplot(ggo1$MF, showCategory = 20, font.size = 8)
upsetplot(ggo1$MF)
# goplot(ggo1)
Tabla resultados
DT::datatable(ggo1$MF@result)
ggo1MFsimplified <- simplify(ggo1$MF)
dotplot(ggo1MFsimplified, showCategory = 20, font.size = 8)
barplot(ggo1MFsimplified, showCategory = 20, font.size = 8)
Link datos: http://bio-bigdata.hrbmu.edu.cn/CellMarker
dataHumanCells <- read.delim(file.path(analysisPath, "/data/Human_cell_markers.txt"))
dataHumanCells <- dataHumanCells %>% tidyr::unite("cellMarker", tissueType,
cancerType, cellName, sep=", ") %>%
dplyr::select(cellMarker, geneID) %>%
dplyr::mutate(geneID = strsplit(geneID, ', '))
Mediante la función enricher podemos ver el enriquecimiento de genesets
propios en nuestro conjunto de genes. Es la msima base que con las anotaciones
GO, pero aquí estamos construyendo nosotros la base de datos.
typeCells <- enricher(unique(allGenesDF$ENTREZID), TERM2GENE = dataHumanCells,
pAdjustMethod = "fdr", minGSSize = 5, qvalueCutoff = 0.2)
dotplot(typeCells, showCategory = 10, font.size = 8)
barplot(typeCells, "GeneRatio", showCategory = 10, font.size = 8)
upsetplot(typeCells)
DT::datatable(typeCells@result)
Subset con células sangre
Estos datos, aunque no los usemos para estos análisis, son una posibilidad para sacar genes que se suponen específicos de monocitos, al menos es algo de donde podemos sacar información.
dataBloodCells <- read.delim(file.path(analysisPath, "/data/Human_cell_markers.txt"))
monocytesInfo <- unique(dataBloodCells[grepl(".*([mM]onocyte)", dataBloodCells$cellName),]$tissueType)
dataBloodCells <- dataBloodCells %>% filter(tissueType %in% monocytesInfo) %>%
tidyr::unite("cellMarker", tissueType, cancerType, cellName, sep=", ") %>%
dplyr::select(cellMarker, geneID) %>%
dplyr::mutate(geneID = strsplit(geneID, ', '))
Con el subset no es capaz de sacar genesets significativos, ya que probablemente los genes que ocmparten sean muy parecidos entre todos los grupos.
typeCellsBlood <- enricher(unique(allGenesDF$ENTREZID), TERM2GENE = dataBloodCells,
pAdjustMethod = "fdr", minGSSize = 5, qvalueCutoff = 0.2)
dotplot(typeCellsBlood, showCategory = 10, font.size = 8)
barplot(typeCellsBlood, "GeneRatio", showCategory = 10, font.size = 8)
DT::datatable(typeCellsBlood@result)
Es la base de datos de GSEA, tienen un apartado específico para células inmunes. Está bien que las entradas salgan relacionadas con monocitos, ya que estamos sesgando mucho el análisis por hacer enriquecimiento en una base de datos donde solo hay células inmunes.
Información datos:
Immunologic signatures collection (also called ImmuneSigDB) is composed of gene sets that represent cell types, states, and perturbations within the immune system. The signatures were generated by manual curation of published studies in human and mouse immunology.
We first captured relevant microarray datasets published in the immunology literature that have raw data deposited to Gene Expression Omnibus (GEO). For each published study, the relevant comparisons were identified (e.g. WT vs. KO; pre- vs. post-treatment etc.) and brief, biologically meaningful descriptions were created. All data was processed and normalized the same way to identify the gene sets, which correspond to the top or bottom genes (FDR < 0.02 or maximum of 200 genes) ranked by mutual information for each assigned comparison.
Link paper: https://www.cell.com/immunity/fulltext/S1074-7613(15)00532-4
geneImmSginature <- msigdbr(species = "Homo sapiens", category = "C7") %>%
dplyr::select(gs_name, entrez_gene)
immCells <- enricher(unique(allGenesDF$ENTREZID), TERM2GENE = geneImmSginature)
dotplot(immCells, showCategory = 10, font.size = 6)
barplot(immCells, "GeneRatio", showCategory = 10, font.size = 6)
upsetplot(immCells)
dso1 <- enrichDGN(allGenesDF$ENTREZID, readable = TRUE)
dotplot(dso1, showCategory = 20, font.size = 8)
barplot(dso1, showCategory = 20, font.size = 8)
upsetplot(dso1)
do1 <- enrichDO(allGenesDF$ENTREZID,
readable = TRUE)
dotplot(do1, showCategory = 20, font.size = 8)
barplot(do1, showCategory = 20, font.size = 8)
upsetplot(do1)
kegg1 <- enrichKEGG(gene = allGenesDF$ENTREZID,
organism = "hsa")
dotplot(kegg1, showCategory = 20, font.size = 8)
barplot(kegg1, showCategory = 20, font.size = 8)
upsetplot(kegg1)
pho_KEGGresult <- find_enriched_pathway(unique(allGenesDF$ENTREZID),
species = 'hsa')
DT::datatable(pho_KEGGresult[[1]])
genesPromoter <- dataAnnotatr %>% filter(grepl("^(promoter).*", annot.id),
!is.na(annot.symbol))
genesPromoter <- unique(genesPromoter$annot.symbol)
genesPromoterDF <- bitr(genesPromoter, fromType = "SYMBOL",
toType = c("ENSEMBL", "ENTREZID"),
OrgDb = org.Hs.eg.db)
clusterProfilerAquí podríamos coger como universo otra cosa, pero creo que lo lógico es utilizar todos los genes.
ggo2 <- sapply(c(c("MF", "CC", "BP")), function(x) {
enrichGO(unique(genesPromoterDF$ENTREZID), org.Hs.eg.db,
ont = "MF", qvalueCutoff = 0.05,
readable = TRUE)}
)
dotplot(ggo2$MF, showCategory = 20, font.size = 8)
barplot(ggo2$MF, showCategory = 20, font.size = 8)
upsetplot(ggo2$MF)
goplot(ggo2$MF)
DT::datatable(ggo2$MF@result)
Para evitar la redundancia de los términos, el paquete te permite limpiar entradas redundantes. Pasa de 1014 a 91 términos enriquecidos, ojo jaja. En cualquier caso siguen siendo bastante inespecíficos.
ggo2MFsimplified <- simplify(ggo2$MF)
dotplot(ggo2MFsimplified, showCategory = 20, font.size = 8)
barplot(ggo2MFsimplified, showCategory = 20, font.size = 8)
Te permite clusterizar genes en función de similitudes en sus términos GO. Te
permite hacerlo con una lista de genes, por clústeres, etc. Además, mide
la similitud entre dos términos GO, siendo de hecho esta funcionalidad la que
implementa el método simplify de clusterProfiler. Esto podría ser útil
en transcriptómica para hacer un GSEA si no tenemos una hipótesis previa (si no
tenemos un geneset previo).
Ejemplo
Se podría hacer con los SYMBOL para hacerlo más legible. Utilizo 20 genes porque revienta metiéndole todos.
hsGO <- godata('org.Hs.eg.db', ont="MF")
distan <- mgeneSim(head(genesPromoterDF$ENTREZID, n = 20), semData = hsGO,
measure = "Wang", combine = "BMA")
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============ | 18%
|
|============= | 18%
|
|============= | 19%
|
|============== | 20%
|
|=============== | 21%
|
|=============== | 22%
|
|================ | 22%
|
|================ | 23%
|
|================= | 24%
|
|================= | 25%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 27%
|
|=================== | 28%
|
|==================== | 28%
|
|==================== | 29%
|
|===================== | 30%
|
|====================== | 31%
|
|====================== | 32%
|
|======================= | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|========================== | 38%
|
|=========================== | 38%
|
|=========================== | 39%
|
|============================ | 40%
|
|============================= | 41%
|
|============================= | 42%
|
|============================== | 42%
|
|============================== | 43%
|
|=============================== | 44%
|
|=============================== | 45%
|
|================================ | 45%
|
|================================ | 46%
|
|================================= | 47%
|
|================================= | 48%
|
|================================== | 48%
|
|================================== | 49%
|
|=================================== | 50%
|
|==================================== | 51%
|
|==================================== | 52%
|
|===================================== | 52%
|
|===================================== | 53%
|
|====================================== | 54%
|
|====================================== | 55%
|
|======================================= | 55%
|
|======================================= | 56%
|
|======================================== | 57%
|
|======================================== | 58%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 60%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 62%
|
|============================================ | 63%
|
|============================================= | 64%
|
|============================================= | 65%
|
|============================================== | 65%
|
|============================================== | 66%
|
|=============================================== | 67%
|
|=============================================== | 68%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 70%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 72%
|
|=================================================== | 73%
|
|==================================================== | 74%
|
|==================================================== | 75%
|
|===================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 78%
|
|======================================================= | 79%
|
|======================================================== | 80%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 82%
|
|========================================================== | 83%
|
|=========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 87%
|
|============================================================= | 88%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 90%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================= | 92%
|
|================================================================= | 93%
|
|================================================================== | 94%
|
|================================================================== | 95%
|
|=================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 97%
|
|==================================================================== | 98%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 100%
pheatmap::pheatmap(distan)
Curiosamente, al llevar a cabo el mismo análisis que el anterior con únicamente los genes anotados en los promotores, solo aparecen dos tipos celulares con un q-value significativo: células embrionarias. Esto significa que los genes que no están anotados en los promotores son importantes, de hecho GREAT no coge solo esos genes, sino los que están a X kb de las anotaciones. Pero bueno, no rayarse, los genesets que utilizamos no tienen por qué ser específicos ni nada.
Cambiar función para que plotee bien los resultados
https://rdrr.io/bioc/enrichplot/src/R/dotplot.R
typeCellsPromoters <- enricher(unique(genesPromoterDF$ENTREZID), TERM2GENE = dataHumanCells,
pAdjustMethod = "fdr", minGSSize = 5, qvalueCutoff = 0.2)
dotplot(typeCellsPromoters, showCategory = 50, font.size = 8)
barplot(typeCellsPromoters, showCategory = 50, font.size = 8)
DT::datatable(typeCellsPromoters@result)
dso2 <- enrichDGN(genesPromoterDF$ENTREZID,
readable = TRUE)
dotplot(dso2, showCategory = 20, font.size = 8)
barplot(dso2, showCategory = 20, font.size = 8)
upsetplot(dso2)
enrichplot::pmcplot(head(dso2$Description), 2012:2019)
if (file.exists(file.path(outputPath, "dm_annotated.rds")) &
file.exists(file.path(outputPath, "dm_random_annotated.rds"))){
dm_annotated <- readRDS(file.path(outputPath, "dm_annotated.rds"))
dm_random_annotated <- readRDS(file.path(outputPath, "dm_random_annotated.rds"))
df_dm_annotated <- data.frame(dm_annotated)
} else {
regionsInter <- read_regions(con = file.path(analysisPath, "data/intersect_iPScells/solo_monocitos.bed"),
genome = 'hg19',
format = 'bed')
annots <- c('hg19_genes_cds', 'hg19_basicgenes', 'hg19_cpgs', 'hg19_genes_intergenic',
'hg19_enhancers_fantom')
annotations <- build_annotations(genome = 'hg19', annotations = annots)
# Intersect the regions we read in with the annotations
dm_annotated <- annotate_regions(regions = regionsInter,
annotations = annotations,
minoverlap = 100L, #Overlap of annotations, f=-0.5
ignore.strand = TRUE,
quiet = FALSE)
df_dm_annotated <- data.frame(dm_annotated)
# Randomize the input regions
dm_random_regions <- randomize_regions(regions = regionsInter,
allow.overlaps = TRUE,
per.chromosome = TRUE)
# Annotate the random regions using the same annotations as above
# These will be used in later functions
dm_random_annotated <- annotate_regions(regions = dm_random_regions,
annotations = annotations,
minoverlap = 100L, #Overlap of annotations, f=-0.5
ignore.strand = TRUE, quiet = TRUE)
saveRDS(dm_annotated, file.path(outputPath, "dm_annotated.rds"))
saveRDS(dm_random_annotated, file.path(outputPath, "dm_random_annotated.rds"))
}
# Find the number of regions per annotation type
dm_annsum <- summarize_annotations(annotated_regions = dm_annotated, quiet = TRUE)
print(dm_annsum)
# A tibble: 13 x 2
annot.type n
<chr> <int>
1 hg19_cpg_inter 3408
2 hg19_cpg_islands 1722
3 hg19_cpg_shelves 277
4 hg19_cpg_shores 6401
5 hg19_enhancers_fantom 748
6 hg19_genes_1to5kb 2528
7 hg19_genes_3UTRs 238
8 hg19_genes_5UTRs 774
9 hg19_genes_cds 490
10 hg19_genes_exons 1813
11 hg19_genes_intergenic 840
12 hg19_genes_introns 8316
13 hg19_genes_promoters 3163
# Count the occurrences of classifications in the Status
# column across the annotation types.
dm_catsum <- summarize_categorical(annotated_regions = dm_annotated,
quiet = TRUE)
print(dm_catsum)
# A tibble: 30,230 x 3
# Groups: annot.type [13]
annot.type annot.id n
<chr> <chr> <int>
1 hg19_cpg_inter inter:10008 3
2 hg19_cpg_inter inter:10011 3
3 hg19_cpg_inter inter:10016 2
4 hg19_cpg_inter inter:10017 1
5 hg19_cpg_inter inter:10026 1
6 hg19_cpg_inter inter:10032 1
7 hg19_cpg_inter inter:10034 5
8 hg19_cpg_inter inter:10048 4
9 hg19_cpg_inter inter:10087 3
10 hg19_cpg_inter inter:10140 1
# … with 30,220 more rows
#View a heatmap of regions occurring in pairs of annotations
annots_order <- c('hg19_genes_promoters', 'hg19_genes_5UTRs',
'hg19_genes_exons', 'hg19_genes_introns',
'hg19_genes_3UTRs', 'hg19_genes_cds',
'hg19_enhancers_fantom')
dm_vs_coannotations <- plot_coannotations(annotated_regions = dm_annotated,
annotation_order = annots_order,
axes_label = 'Annotations',
plot_title = 'Regions in Pairs of Annotations')
print(dm_vs_coannotations)
dm_annotations_plot <- plot_annotation(annotated_regions = dm_annotated,
annotation_order = annots_order,
plot_title = 'Annotations for monocytes-iPCs',
x_label = 'Annotation type',
y_label = 'Count')
print(dm_annotations_plot)
# View the number of regions per annotation and include the annotation
# of randomized regions
dm_annotations_plot_wrandom = plot_annotation(annotated_regions = dm_annotated,
annotated_random = dm_random_annotated,
annotation_order = annots_order,
plot_title = 'Annotations for monocytes-iPC (with rndm.)',
x_label = 'Annotation type',
y_label = 'Count')
print(dm_annotations_plot_wrandom)
#CpGIslands
annots_order = c('hg19_cpg_islands', 'hg19_cpg_shores',
'hg19_cpg_shelves', 'hg19_cpg_inter')
dm_annotations_plot_wrandom = plot_annotation(annotated_regions = dm_annotated,
annotated_random = dm_random_annotated,
annotation_order = annots_order,
plot_title = 'Annotations for monocytes-iPC (with rndm.)',
x_label = 'Annotation type',
y_label = 'Count')
print(dm_annotations_plot_wrandom)
genesMonoIPC <- data.frame(ENTREZID = df_dm_annotated$annot.gene_id,
SYMBOL = df_dm_annotated$annot.symbol)
intersectGenes <- intersect(genesMonoIPC$ENTREZID, allGenesDF$ENTREZID)
cat(">> Número de genes en E1:", length(unique(allGenesDF$ENTREZID)))
>> Número de genes en E1: 10349
cat("\n>> Número de genes en Mono-iPCs:", length(unique(genesMonoIPC$ENTREZID)))
>> Número de genes en Mono-iPCs: 4932
cat("\n>> Número de genes coincidentes:", length(intersectGenes))
>> Número de genes coincidentes: 4706
ggo3 <- sapply(c("MF", "CC", "BP"), function(term) {
enrichGO(unique(genesMonoIPC$ENTREZID), org.Hs.eg.db,
ont = term, qvalueCutoff = 0.05,
readable = TRUE)})
dotplot(ggo3$MF, showCategory = 20, font.size = 8)
barplot(ggo3$MF, showCategory = 20, font.size = 8)
upsetplot(ggo3$MF)
# goplot(ggo1)
kegg3 <- enrichKEGG(unique(genesMonoIPC$ENTREZID),
organism = "hsa")
dotplot(kegg3, showCategory = 20, font.size = 8)
barplot(kegg3, showCategory = 20, font.size = 8)
upsetplot(kegg3)